[cudax] Implement `cudax::coop::reduce` for warp groups within a block by davebayer · Pull Request #9258 · NVIDIA/cccl

davebayer · 2026-06-04T12:32:05Z

coderabbitai · 2026-06-04T12:39:18Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR implements cudax::coop::reduce for arbitrary warp-groupings within a CUDA block (fixes #9160). It adds a warp-group reduction path that performs a two-stage reduction across warps inside a block, adjusts the public wrapper to take the reduction functor by value, and adds Catch2 tests validating multi-warp reductions. Several supporting group/mapping/synchronizer fixes were included to ensure correct rank/mapping and barrier initialization when units are not at thread level.

Core Implementation Changes

cudax/include/cuda/experimental/__coop/reduce.cuh
- Added an overload of __reduce_impl constrained to groups with unit_type == warp_level and level_type == block_level.
- Two-stage reduction:
  - Per-warp reductions using cub::WarpReduce with per-warp TempStorage; each warp writes one partial into shared memory.
  - Synchronize, then reduce per-warp partials using a root-warp cub::WarpReduce; only the group root returns a cuda::std::optional containing the final reduced value, other threads return nullopt.
- Public wrapper signature changed from reduce(_Group, _Tp (&)[_Np], _RedFn&&) → reduce(_Group, _Tp (&)[_Np], _RedFn) (functor passed by value and forwarded into the impl).

Supporting Changes

cudax/include/cuda/experimental/__group/queries.cuh
- In the __rank_query_group non-matching branch, the linear rank computation now multiplies the group-unit rank by the unit count derived from __group.hierarchy() and adds the unit-local rank (uses __count_query/__rank_query from the hierarchy).
cudax/include/cuda/experimental/__group/group.cuh
- Lane-mask validation in __do_mapping made conditional on _Unit: when _Unit == thread_level assert lane inclusion and popcount constraints; otherwise require lane mask == all lanes (previously always validated per-lane).
cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
- make_instance now precomputes per-Unit thread metadata (__nthread_in_unit, __thread_rank_in_unit) and derives them from the parent hierarchy when _Unit != thread_level.
- Barrier initialization guard tightened to require mapping valid, mapping rank == 0, and __thread_rank_in_unit == 0.
- Barrier init count scaled to __mapping_result.count() * __nthread_in_unit to account for multiple threads per unit.

Tests

cudax/test/coop/reduce/warps_within_block.cu (new)
- Adds ReduceKernel and parametrized tests that create groups of nwarps_in_group warps and verify cudax::coop::reduce across multiple warps.
- Tests:
  - Integral types with integer reduction operators (plus/mul/bitwise/min/max) — exact equality.
  - Floating-point types (float/double) with FP ops (plus/mul/min/max) — relative tolerance.
  - Per-thread input sizes 1–4.
- Kernel runtime config supports block size = (nwarps_in_group + 2) * warp_size; only group-member threads participate in the reduction and only group root writes output.
cudax/test/CMakeLists.txt
- Registers new test target coop.reduce.warps_within_block.

API / ABI Surface Changes

Added internal overload: __reduce_impl(_Group, _Tp (&)[_Np], _RedFn) for warp-group reductions.
Changed public reduce signature to take the reduction functor by value (_RedFn), instead of an rvalue reference.

Review Notes / Areas to Verify

Shared-memory layout and sizing for per-warp partials and correctness of root-warp selection (ensure alignment and no bank-conflict issues).
Correctness and performance across varying warp-group sizes and SM architectures (two-stage approach and root-warp reduction).
Impact of changing public reduce functor parameter to pass-by-value on forwarding/move semantics.
Barrier initialization and mapping correctness when units contain multiple threads (ensure counts/ranks align with mapping and that lane-mask validation changes are appropriate).

important:

Walkthrough

Adds a warp-scoped two-stage cooperative reduce overload, conditions group mapping/queries on unit hierarchy and adjusts barrier init for per-unit threads, and adds parametrized CUDA tests validating multi-warp reductions across types, ops, and item counts.

Changes

Cooperative reduce for warps within block

Layer / File(s)	Summary
Group mapping lane-mask validation `cudax/include/cuda/experimental/__group/group.cuh`	Make post-mapping lane-mask assertions conditional on `_Unit`: keep inclusion/popcount checks for `thread_level`, require `lane_mask::all()` for others.
Group hierarchy ranking correction `cudax/include/cuda/experimental/__group/queries.cuh`	Compute `__unit_count` via `__count_query<_Unit,_GroupUnit>(__group.hierarchy())` and return linear rank as `__group_unit_rank * __unit_count + __unit_rank`.
Barrier synchronizer initialization fix `cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh`	Compute per-unit thread metadata for non-`thread_level`, require `__thread_rank_in_unit == 0` for barrier init, and scale init count by `__nthread_in_unit`.
Core warp-level reduce implementation `cudax/include/cuda/experimental/__coop/reduce.cuh`	Add constrained `__reduce_impl` performing per-warp `cub::WarpReduce` into shared memory, then a root-warp final `cub::WarpReduce`; only root warp/lane returns `optional` result. Change public `reduce` to take reduction functor by value and forward it.
Test validation for warp-level reduce `cudax/test/CMakeLists.txt`, `cudax/test/coop/reduce/warps_within_block.cu`	Add `coop.reduce.warps_within_block` test target and parametrized tests: `ReduceKernel`, helpers, and tests for integral and floating-point types/operators, verifying NumItems=1..4 outputs (exact for integrals, WithinRel for floats).

Assessment against linked issues

Objective	Addressed	Explanation
Implement `cudax::coop::reduce` for arbitrary warps within a block groups [`#9160`]	✅

Possibly related PRs

NVIDIA/cccl#9167: Adds other __reduce_impl overloads and related tests; closely aligned with multi-overload reduce work.
NVIDIA/cccl#9154: Earlier changes to __reduce_impl overloads and reduce wrapper plumbing; directly related.
NVIDIA/cccl#9264: Similar conditionalization of lane-mask validation in group.cuh; related to group mapping changes.

Suggested labels

cudax

Suggested reviewers

andralex
miscco
pciolkosz

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cudax/test/coop/reduce/warps_within_block.cu (1)

152-153: ⚡ Quick win

suggestion: The test names and tags still say this_warp on Lines 152 and 173, but this file/target is warps_within_block. Rename the C2H_TEST names/tags to warps_within_block to keep filtering and failure triage unambiguous.

Also applies to: 173-174

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b9ab43e4-f31a-4ae2-94bb-fdad331fffa3

📥 Commits

Reviewing files that changed from the base of the PR and between 2f7cb8b and 3aebfa2.

📒 Files selected for processing (5)

cudax/include/cuda/experimental/__coop/reduce.cuh
cudax/include/cuda/experimental/__group/queries.cuh
cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
cudax/test/CMakeLists.txt
cudax/test/coop/reduce/warps_within_block.cu

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a491f9e-17cf-4aba-a115-6d696e80c0f0

📥 Commits

Reviewing files that changed from the base of the PR and between 090635a and a7140fb.

📒 Files selected for processing (6)

cudax/include/cuda/experimental/__coop/reduce.cuh
cudax/include/cuda/experimental/__group/group.cuh
cudax/include/cuda/experimental/__group/queries.cuh
cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
cudax/test/CMakeLists.txt
cudax/test/coop/reduce/warps_within_block.cu

🚧 Files skipped from review as they are similar to previous changes (5)

cudax/include/cuda/experimental/__group/synchronizer/barrier_synchronizer.cuh
cudax/test/CMakeLists.txt
cudax/include/cuda/experimental/__group/queries.cuh
cudax/test/coop/reduce/warps_within_block.cu
cudax/include/cuda/experimental/__coop/reduce.cuh

github-actions · 2026-06-08T21:01:17Z

🥳 CI Workflow Results

🟩 Finished in 35m 02s: Pass: 100%/55 | Total: 8h 09m | Max: 35m 00s | Hits: 71%/46192

See results here.

davebayer requested a review from a team as a code owner June 4, 2026 12:32

davebayer requested a review from andralex June 4, 2026 12:32

github-project-automation Bot added this to CCCL Jun 4, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 4, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread cudax/test/coop/reduce/warps_within_block.cu

This comment has been minimized.

Sign in to view

davebayer force-pushed the cudax_coop_reduce_warp_groups branch from 3aebfa2 to 05ebea1 Compare June 5, 2026 11:47

This comment has been minimized.

Sign in to view

[cudax] Implement cudax::coop::reduce for warp groups within a block

a7140fb

davebayer force-pushed the cudax_coop_reduce_warp_groups branch from 090635a to a7140fb Compare June 8, 2026 20:24

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread cudax/include/cuda/experimental/__group/group.cuh

davebayer requested review from pciolkosz and removed request for andralex June 8, 2026 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cudax] Implement `cudax::coop::reduce` for warp groups within a block#9258

[cudax] Implement `cudax::coop::reduce` for warp groups within a block#9258
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:cudax_coop_reduce_warp_groups

davebayer commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Reviews paused

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

coderabbitai Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davebayer commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Summary

Core Implementation Changes

Supporting Changes

Tests

API / ABI Surface Changes

Review Notes / Areas to Verify

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 8, 2026

🥳 CI Workflow Results

🟩 Finished in 35m 02s: Pass: 100%/55 | Total: 8h 09m | Max: 35m 00s | Hits: 71%/46192

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading